A new model for persian multi-part words edition based on statistical machine translation

نویسندگان

A. Arjomandzadeh School of Computer Engineering & Information Technology, University of Shahrood, Shahrood,Iran.

M. Zahedi School of Computer Engineering & Information Technology, University of Shahrood, Shahrood,Iran.

چکیده مقاله:

Multi-part words in English language are hyphenated and hyphen is used to separate different parts. Persian language consists of multi-part words as well. Based on Persian morphology, half-space character is needed to separate parts of multi-part words where in many cases people incorrectly use space character instead of half-space character. This common incorrectly use of space leads to some serious issues in Persian text processing and text readability. In order to cope with the issues, this work proposes a new model to correct spacing in multi-part words. The proposed method is based on statistical machine translation paradigm. In machine translation paradigm, text in source language is translated into a text in destination language on the basis of statistical models whose parameters are derived from the analysis of bilingual text corpora. The proposed method uses statistical machine translation techniques considering unedited multi-part words as a source language and the space-edited multi-part words as a destination language. The results show that the proposed method can edit and improve spacing correction process of Persian multi-part words with a statistically significant accuracy rate.

Download for Free

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

a new model for persian multi-part words edition based on statistical machine translation

multi-part words in english language are hyphenated and hyphen is used to separate different parts. persian language consists of multi-part words as well. based on persian morphology, half-space character is needed to separate parts of multi-part words where in many cases people incorrectly use space character instead of half-space character. this common incorrectly use of space leads to some s...

متن کامل

A Multi-Domain Translation Model Framework for Statistical Machine Translation

While domain adaptation techniques for SMT have proven to be effective at improving translation quality, their practicality for a multi-domain environment is often limited because of the computational and human costs of developing and maintaining multiple systems adapted to different domains. We present an architecture that delays the computation of translation model features until decoding, al...

متن کامل

Statistical Machine Translation as a Grammar Checker for Persian Language

Existence of automatic writing assistance tools such as spell and grammar checker/corrector can help in increasing electronic texts with higher quality by removing noises and cleaning the sentences. Different kinds of errors in a text can be categorized into spelling, grammatical and real-word errors. In this article, the concepts of an automatic grammar checker for Persian (Farsi) language, is...

متن کامل

Deeper than Words: Morph-based Alignment for Statistical Machine Translation

In this paper we introduce a novel approach to alignment for statistical machine translation. The core idea is to align subword units, or morphs, instead of word forms. This results in a joint segmentation and alignment model, aimed to improve translation quality for morphologically rich languages and reduce the size of the required parallel corpora. Here we focus on translating from inflection...

متن کامل

A Sense-Based Translation Model for Statistical Machine Translation

The sense in which a word is used determines the translation of the word. In this paper, we propose a sense-based translation model to integrate word senses into statistical machine translation. We build a broad-coverage sense tagger based on a nonparametric Bayesian topic model that automatically learns sense clusters for words in the source language. The proposed sense-based translation model...

متن کامل

A Novel Reordering Model Based on Multi-layer Phrase for Statistical Machine Translation

Phrase reordering is of great importance for statistical machine translation. According to the movement of phrase translation, the pattern of phrase reordering can be divided into three classes: monotone, BTG (Bracket Transduction Grammar) and hierarchy. It is a good way to use different styles of reordering models to reorder different phrases according to the characteristics of both the reorde...

متن کامل

منابع من

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

عنوان ژورنال

Journal of Artificial Intelligence and Data Mining

دوره 4 شماره 1

صفحات 27- 34

تاریخ انتشار 2016-01-01

دنبال کردن

لغو دنبال کردن

{@ msg @}

با دنبال کردن یک ژورنال هنگامی که شماره جدید این ژورنال منتشر می شود به شما از طریق ایمیل اطلاع داده می شود.

کلمات کلیدی

Persian Multi-Part Words Statistical Machine Translation Fertility-based IBM Model Syntax-Based Decoder Spacing Rules

میزبانی شده توسط پلتفرم ابری doprax.com